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Abstract: 

Businesses  and  Government  are  investing  heavily  in  their  data  assets.  As  data 
quantities  continue  to  grow  rapidly,  it  is  increasingly  difficult  to  extract  maximum 
value  from  those  data  stores.  Learning  to  predict  the  future  from  past  observations  is 
one  of  the  key  components  that  make  it  possible  to  bring  value  to  data. 

To  date,  much  of  the  research  effort  has  been  devoted  to  drawing  predictions  about  a 
single  pre-defined  target  variable,  such  as  predicting  the  magnitude  of  the  global 
warming,  or  the  probability  of  developing  cancer.  However,  in  many  real-world 
applications,  what  we  wish  to  predict  can  change  dramatically  from  one  instance  to 
the  next,  e.g.  from  one  tactical  situation  to  another  or  from  one  client  to  another. 
State-of-the-art  techniques  only  provide  ad-hoc  solutions  to  this  problem,  because 
learning  one  model  for  every  possibly  encountered  situation  does  not  scale  to  big 
data  assets. 

This  project  investigated  methods  for  learning  a  single  model  that  can  effectively  and 
efficiently  predict  all  unobserved  variables  from  the  currently  available  evidence.  We 
developed  new  technologies  to  learn  models  with  this  property  from  large  and 
high-dimensional  data.  Our  results  show  that  our  techniques  offer  a  gain  of  4  orders 
of  magnitude  in  computation  time  over  the  state  of  the  art. 

1.  Introduction: 

Data  analytics  increasingly  underpin  many  core  processes  in  industry,  commerce, 
governance  and  science.  Data  is  only  a  key  strategic  asset  because  knowledge 
discovery  methods  make  it  valuable.  As  data  quantity  inexorably  rises,  more 
effective  analytic  techniques  that  can  extract  greater  information  from  big  data  will 
add  tremendous  value.  To  date, 
much  big  data  research  has  been 
devoted  to  drawing  predictions 
about  a  single  target  variable:  the 
magnitude  of  global  warming, 
for  example,  or  the  probability 
of  developing  cancer. 

Classification  is  the  task  that  aims  Figure  1:  classification  example 
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at  predicting  the  value  taken  by  a  categorical  variable  from  observations  over  a 
pre-defined  set  of  variables.  In  this  very  mature  field  of  research,  many  algorithms 
have  been  proposed  to  learn  how  to  draw  accurate  predictions  from  data  [1]  -  see 
Figure  1  for  an  example. 

However,  classification  is  limited  to  predicting  a  predefined  and  unique  variable. 

This  project  sought  to  develop  methods  that  leam  how  to  effectively  and  efficiently 
predict  any  set  of  variables  from  any  other  set  of  variables.  This  is  necessary  because, 
in  many  real-world  applications,  what  we  know  and  what  we  wish  to  predict  can 
change  dramatically  from  one  instance  to  the  next,  e.g.  from  one  tactical  situation  to 
another  or  from  one  client  to  another.  Traditional  classification  models  can  be  of 
limited  value  to  data  practitioners  because  they  fail  to  provide  flexibility  to  predict 
whatever  happens  to  be  unobserved  in  any  given  context.  Consider  a  bank  adviser 
meeting  a  new  client  for  the  first  time.  From  one  new  client  to  another,  the  bank 
adviser  might  be  interested  in  predicting  very  different  variables;  from  the  likely 
income  of  the  new  client’s  household,  to  the  likelihood  of  him  or  her  to  be  interested 
in  a  specific  insurance  policy.  Moreover,  the  information  (variables)  that  the  adviser 
might  be  able  to  collect  from  one  new  client  to  the  next  might  be  very  different, 
depending  on  the  course  of  the  conversation.  This  process  is  exemplified  in  Figure  2 
below,  where  the  infonnation  collected  by  the  adviser  are  depicted  in  blue,  while  the 
predicted  values  are  depicted  in  green. 
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Figure  2:  Example  illustrating  the  need  for  more  flexible  learners  -  in  blue:  the 
information  the  adviser  has  acquired  about  his  or  her  new  client  over  the  course 
of  the  conversation;  in  green  the  information  predicted  by  an  ideal  system. 

The  same  needs  arise  in  many  domains  and  situations.  In  systems  supporting  medical 
diagnosis,  general  practitioners  might  be  interested  in  predicting  different  sets  of 
variables  for  different  patients  while  having  different  infonnation  about  their  medical 
condition  and  history.  In  asset  management,  financial  advisers  might  wish  to  study 
the  likelihood  of  different  stocks  gaining  value  after  having  learned  about  any  of 
many  different  types  of  infonnation  about  the  rest  of  the  market). 

We  call  this  task  ‘Omnidirectional  learning’:  being  able  to  learn  from  data  described 
by  a  fixed  set  of  variables  Z,  and  being  able  to  predict  any  subset  of  variables  tcZ, 
from  any  subset  of  the  remaining  ones  X  £  Z\Y. 

Progress  and  limitations  to  date  in  this  field  of  research. 

‘Omnidirectional  learning’  is  not  a  standard  tenn  in  the  literature,  and  actually 
corresponds  to  a  task  that  has  only  previously  received  limited  attention  and  failed  to 
be  recognized  as  a  distinct  task  requiring  specialized  methods.  We  now  review  the 
state  of  the  research  literature  in  the  techniques  that  are  related  to  this  task,  and 
explain  how  no  existing  algorithm  can  consistently  fulfill  the  function  of 
omnidirectional  learner. 

Over  the  last  decades,  there  has  been  significant  research  interest  in  learning 
classification  models  from  data.  Classification  learners  aim  at  estimating  the 
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probability  of  a  target  variable  y  (the  class),  given  the  values  taken  by  a  set  X  of  M 
variables x  =  {%, i.e.  constructing  a  model  of  p(y \x).  Since  the  1950s,  a  large 
variety  of  classification  algorithms  and  strategies  have  been  studied,  with  different 
properties,  behaviors  and  cases  they  are  particularly  suitable:  from  Naive  Bayes, 
logistic  regression  [5]  and  decision  trees,  to  SVMs  [14]  and  Random  Forest  [15].  The 
reader  can  refer  to  [1]  for  a  recent  review  of  the  main  classification  algorithms.  A 
naive  way  to  address  our  omnidirectional  task  would  thus  be  to  utilize  state-of-the-art 
classification  algorithms,  and  compose  a  solution  that  leams  to  predict  every  possible 
variable  from  every  possible  subset  of  the  remaining  ones. 

There  is  a  first  functional  objection  to  the  use  of  independent  classifiers  for 
omnidirectional  learning:  predicting  every  target  variable  of  Y  independently 
completely  ignores  the  complex  dependencies  that  might  exist  between  them.  For 
example,  number  of  children  and  number  of  bedrooms  are  interdependent,  and  so 
predicting  them  separately  will  inevitably  lead  to  important  inaccuracies. 

The  second  objection  is  functional:  using  independent  classifiers  for  omnidirectional 
learning  is  impossible  for  most  datasets,  because  this  requires  learning  a  different 
model  for  every  possible  target  variable,  i.e.  2M  models.1  The  only  way  to  avoid 
having  to  build  an  exponential  number  of  models  would  be  to  use  models  that  can 
handle  missing  values.  That  would  make  it  possible  to  build  a  single  model  per  target 
variable  and  treat  the  other  unobserved  variables  during  classification  as  unknown. 
Three  main  strategies  have  been  identified  to  deal  with  missing  values  at  prediction 
time  [2]: 

1.  Discard  instances',  discarding  instances  with  missing  values.  This  goes  against  the 
aim  of  omnidirectional  learning,  because  all  instances  will  have  unobserved 
variables  and  hence  be  discarded. 

2.  Imputation :  estimating  the  value  or  distribution  of  the  unobserved  variables,  which 
would  produce  a  typical  chicken-and-egg  problem  for  omnidirectional  learning, 
because  if  variables  X1  and  X2  are  unobserved,  we  would  need  to  estimate  X2  to 
predict  Xx  and  vice  versa. 

3.  Reduced-feature  Models:  using  a  different  model,  constructed  to  contain  only  the 
observed  variables  of  test  instance.  There  would  then  be  three  possible  strategies: 

a.  Leam  all  the  possible  models  in  advance,  which  we  have  shown  above  to 
be  unfeasible  for  datasets  with  more  than  a  dozen  variables. 

b.  Leam  the  whole  model  at  classification  time  -  a  strategy  known  as  “lazy 
classification”  [3]  -  for  which  the  time  required  to  perfonn  the 
classification  would  be  incompatible  with  most  applications. 

c.  Marginalize  over  the  unobserved  variables  of  the  learned  model,  which  is 
only  feasible  for  either  simple  models  like  Naive  Bayes  -  which  will  often 
not  provide  competitive  predictions  -  or  if  only  a  few  variables  are 
unobserved."  That  is  contrary  to  the  specifications  of  our  omnidirectional 
framework. 


1  With  the  target  variable  fixed,  each  remaining  variable  has  to  be  either  evidence/obsen’ed,  or 
missing,  which  leads  to  2M  possible  models. 

2  In  the  general  case,  marginalizing  is  exponential  with  the  number  of  variables  over  which  the 
marginalization  is  performed. 
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Thus,  composing  an  omnidirectional  learner  from  existing  classifiers  would  be  both 
unsound  and  unfeasible. 

Graphical  models  [4]  constitute  a  more  consistent  way  to  target 
omnidirectional  learning,  because: 

1 .  They  are  models  of  the  joint  probability  with  no  particular  class,  and  can  thus 
model  complex  dependencies  between  the  variables. 

2.  They  have  efficient  inference  algorithms  that  make  it  possible  to  marginalize 
the  joint  probability  over  any  set  of  variables.  This  would  make  it  possible  to 
compute  the  probability  (or  belief)  over  our  set  Y  of  unobserved  variables, 
conditioned  on  some  evidence  over  the  remaining  set  of  variables  (X). 


However,  using  graphical  models  for  omnidirectional  learning  faces  three  main 
scientific  obstacles: 

Obstacle  # 1 :  How  to  efficiently  marginalization  over  any  set  of  variables? 
Echoing  our  discussion  about  the  treatment  of  missing  values  for  classification, 
making  predictions  from  models  of  the  joint  probability  requires  marginalization. 
When  a  single  variable  is  the  target  of  the  prediction  and  all  values  of  the  other 
variables  are  known,  the  conditional  probability  is  obtained  by  marginalization  over 
the  target: 


P(zi\  z2,-,zM)  = 


p(zltz2,- ,zm) 


ZvtDomtZOpiv,  z2  <ZM ) 

However,  in  the  general  case,  such  marginalization  is  only  possible  for  one  target 
variable;  while  omnidirectional  learning  requires  being  able  to  get  a  prediction  about 
several  (and  potentially  hundreds)  of  variables.  As  we  have  described  above,  this 
process  is  called  inference  or  belief  propagation  for  graphical  models.  It  works  by 
iteratively  updating  local  probabilities  depending  on  local  neighborhoods  of  the 
graph,  until  convergence  of  the  marginal  probabilities.  However,  for  general 
graphical  models,  only  approximate  algorithms  exist  (e.g.  loopy  belief  propagation), 
for  which  convergence  to  actual  marginals  is  uncertain  [6]. 

Obstacle  #2:  How  to  learn  from  large  and  high-dimensional  datasets? 
Learning  graphical  models  from  data  has  been  of  major  interest  since  the  1990s  with 
various  methods  proposed  for  log-linear  models  [5],  Bayesian  networks  [7,8]  and 
Markov  Random  Fields  [9,10],  However,  to  the  best  of  our  knowledge,  other  than 
our  own  work  in  the  area  [12,13],  no  state-of-the-art  method  can  efficiently  learn 
from  datasets  with  more  than  about  50  variables  without  making  strong 
assumptions  about  the  distribution  from  which  the  data  is  drawn.  This  is,  for  example, 
the  case  for  methods  using  1 1 -regularizers  [9,25]  which  assume  that  every  variable 
will  interact  with  a  very  low  number  of  variables,  and  for  the  PC/IC  algorithms  [26] 
which  assume  that  the  conditional  independencies  can  be  discovered  with  reduced 
subsets  of  variables.  These  assumptions  lower  the  accuracy  and  reliability  of  the 
results. 


In  this  project,  we  have  focused  on  effectively  and  effectively  learn  junction 
tree  models.  Junction  trees  are  a  perfect  class  of  graphical  models  for 
omnidirectional  learning  because  they  allow  for  efficient  and  exact  marginalization 
over  any  set  of  variables,  which  directly  solves  Obstacle  #1.  This  project  focused  on 


3  See  for  example  the  running  times  in  hours  for  most  state-of-the-art  methods  on  datasets  with  no 
more  than  50  variables  in  [10]. 
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the  learning  of  the  structure  of  junction  tree  models  for  data  assets  with  1,000+ 
variables. 

2.  Our  approach:  Prioritized  Chordalysis 

In  a  series  of  papers  published  in  2013  and  2014  [12,13],  we  had  shown  how  to 
effectively  learn  junction  tree  models  (also  known  as  decomposable  models)  from 
data  with  medium  dimensionality  of  up  to  100  variables.  However,  further  scalability 
remained  limited,  because  state-of-the-art  techniques  have  to  examine  every  edge  at 
every  step  of  the  search.  For  example,  learning  a  junction-tree  with  the 
state-of-the-art  for  high-dimensional  data  requires  more  than  3  days  for  a  dataset 
with  700  variables  (see  our  experiments  on  the  Protein  dataset  below). 

The  technology  and  algorithms  developed  in  this  project  are  based  on  the  idea  that,  at 
every  step  of  a  forward  search  for  the  best  graphical  model,  it  is  only  necessary  to 
reconsider  a  subset  of  edges  for  addition  to  the  successively  refined  models.  Let  us 
motivate  this  idea  with 
another  real-world  dataset 
representing  30,000  news 
articles  described  over  500 
variables  (see  the  description 
of  dataset  ‘ABC  News’  in  the 
experiments  below  for  more 
details).  On  the  one  hand,  we 
recorded  how  many  edges  are 
examined  by  the  LLA  process. 

We  report  this  number  over 
the  course  of  the  LLA  process 
in  the  top  curve  in  figure  on 
the  right:  the  process  examines  the  addition  of  more  than  10,000,000  edges.  On  the 
other  hand,  we  looked  at  how  many  edges  actually  lead  to  the  same  evaluation  of  the 
model  between  successive  steps  of  the  search.  We  report  the  difference  -  i.e.  the 
number  of  times  that  edges  need  to  be  re-examined  after  the  very  first  step  -  in  the 
bottom  curve  of  the  figure  above:  only  about  10,000  edges'  additions  require 
re-examination.  Note  that  the  remainder  of  this  report  will  make  it  clear  how  this 
graph  can  be  generated. 

This  means  that  the  vast  majority  of  the  computation  could  be  avoided  if  we  knew 
which  edges  would  lead  to  the  same  evaluation  of  the  model. 

This  observation  is,  quite  simply,  what  the  technologies  that  we  developed  aim  at 
leveraging  on:  showing  how  to  exactly  predict  that  an  edge  will  need  to  be 
re-examined,  and  designing  an  algorithm  that  utilizes  that  knowledge  to  learn 
junction- tree  models  several  orders  of  magnitude  faster  than  the  current 
state-of-the-art  methods. 

Our  experiments  on  real-world  datasets  with  up  to  2,000  variables  show  that  our 
algorithm,  Prioritized  Chordalysis,  can  search  for  a  junction-tree  model  about  4 
orders  of  magnitude  faster  than  state-of-the-art  techniques,  without  making  any 
additional  assumption. 
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Let  us  motivate  over  a  few  simple  examples  why  only  a  very  limited  number  of 
edges  need  to  be  re-examined  at  each 
step  of  the  search. 

•  Case  1:  disconnected 

components.  Consider  the 
model  of  a  joint  distribution 
over  four  variables  (age  -  a, 
height  -  h,  gender  -  g  and 
cholesterol  -  c)  illustrated 
figure  (a)  on  the  right. 

Starting  with  a  model 
considering  that  the  4 
variables  are  independent,  the 
first  step  consists  of  finding 
which  one  of  the  6  edges  will 
result  in  the  best  model.  To  this  end,  the  addition  of  every  single  edge  is 
evaluated.  Let  us  assume  that  this  model  is  the  one  including  edge  {a,h},  i.e. 
including  the  correlation  between  age  and  height.  The  second  step  is  then 
going  to  assess  the  addition  of  every  single  edge  again.  The  score4  associated 
with  the  addition  of  edge  {g,c} remains  identical,  regardless  of  it  being  added 
at  the  first  or  second  step,  because  associated  variables  are  not  in  the  same 
connected  components  of  the  graph,  and  hence  not  “interacting”  in  the  model; 
cholesterol  and  gender  are  independent  of  age  and  height.  As  a  result,  this 
edge  need  not  be  evaluated  again  at  the  second  step. 

•  Case  2:  empty  minimal  separator.  Consider  now  figure  (b)  that  results  from 
including  the  interaction  {h,g}  at  step  2.  The  third  step  is  then  going  to 
examine  the  addition  of  every  remaining  edge  again.  The  score  associated 
with  the  addition  of  edge  {g,c}  is  identical,  regardless  of  it  being  added  at  the 
first,  second  or  third  step,  because  adding  {g,c}  will  “explain”  the  same 
quantity  of  infonnation  in  all  three  models;  cholesterol  is  independent  of  age 
and  height  given  gender.  We  will  see  that  this  is  due  to  an  empty  minimal 
separator  between  g  and  c:  S{gc}  —  0,  i.e.  there  is  no  vertex  to  remove  from 
the  graph  to  disconnect  g  from  c.  As  a  result,  this  edge  need  not  be  evaluated 
again  at  the  second  and  third  step. 

•  Case  3:  identical  minimal  separator.  Consider  the  more  elaborate  model  over 
9  variables  illustrated  in  figure  (c)  above,  where  the  numbers  on  the  edges 
indicate  the  steps  at  which  they  were  added.  We  show  that  from  step  3,  the 
addition  of  edge  {f,g}  to  any  successively  refined  model  need  not  be 
evaluated  again  and  that  the  score  of  adding  {f,g}  will  remain  invariant.  This 
is  motivated  by  the  fact  that,  from  step  3  on-wards,  removing  the  vertex  e 
disconnects  f  from  g  (Sijg]  —  {e}).  In  consequence,  the  last  time  that  the 
addition  of  this  edge  needs  to  be  evaluated  is  at  step  3. 

We  will  now  prove  the  validity  of  these  intuitions.  It  is  interesting  to  observe  that 
being  able  to  tell  if  an  edge  has  to  be  re-evaluated  is  not  sufficient,  because  the 


4  Note  that  our  observations  are  valid  for  different  scores  such  as  p-value  computed  from 
log-likelihood  ratios,  Kulbach  Leibler  divergence  and  MDL/MML  scores.  In  the  remainder  of  this 
report,  we  focus  on  p-value. 
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search  process  will  still  enumerate  over  all  the  edges  at  every  step.  This  enumeration 
prevents  us  from  scaling  to  datasets  with  thousands  of  variables,  because  there  are 
0(M2)  M  variables.  We  will  show  that  Prioritized  Chordalysis  can  precisely 
identify  the  edges  that  have  to  be  re-evaluated,  and  use  this  infonnation  to  maintain  a 
data  structure  that  makes  it  possible  avoid  such  enumeration. 

2.1  What  edges  require  re-examination? 

We  have  shown  in  [12]  that  computing  the  statistical  significance  (p-value)  of 
replacing  a  current  reference  model  M*  by  a  candidate  model  Mc  requires  two 
elements:  the  difference  in  the  fit  C2  and  the  difference  in  the  complexity  dfr.  We 
now  develop  these  elements  for  our  target  class  of  models,  i.e.  junction  tree  models. 
Definition  1:  Let  G  —  [V,  E}  be  an  undirected  graph  and  two  vertices  a,bEV. 
The  set  of  vertices  S  c  V  is  a  (a,b)-separator  if  removing  S  from  G  separates  the 
vertices  a  and  b  into  different  connected  components.  If  no  proper  subset  of  S  is  an 
(a,b)-separator,  then  S  is  a  minimal  (a-b)-separator,  noted  $ab- 
Moreover,  we  have  recently  shown  that: 

Theorem  1  [12]:  If  two  decomposable  models  Mc  c  M*  differ  only  in  one  edge 
(a,b),  then: 

G2r  =  2  ■  N(K(SabU  {a})  +H(SabU{6}) 

—  H  (Sab  U  {a,  b})  —  H(Sab)) 
where  H(.)  denotes  the  entropy. 

We  can  thus  formulate  the  following  theorem: 

Theorem  2:  Let  Ml  and  Ml  be  two  reference  models  selected  at  different  steps  of 
the  search,  Gl  —  { V ,  El }  and  G*z  —  {V,  El }  their  associated  graphs,  and  a,  b  two 
verticies  such  that  there  is  no  edge  between  a  and  b  in  either  models  and  Gl  U  [a,  b } 
and  Gl  U  [a,  b]  are  both  chordal  graphs  (i.e.  adding  { a,b}  to  either  graphs  keeps  the 
models  in  the  class  of  junction-tree  models).  If  S  is  a  minimal  (a,b)-separator  in  Gl 
and  Gl  (S*^  =  S*§),  then  the  p-value  associated  with  the  addition  of  {a, b}  to  Ml  is 
identical  to  the  p-value  associated  with  the  addition  of  {a,b}  to  Ml . 

Proof.  Direct  from  simplification  of  G 2  with  S^b  =  Slit  ■  ■ 

A  direct  consequence  of  this  theorem  is  that  the  p-value  associated  with  the  addition 
of  an  edge  only  has  to  be  re-evaluated  between  two  steps  of  LLA  if  its  minimal 
separator  changes  between  these  steps.  The  possible  gain  in  computation  then 
depends  upon  how  frequently  do  minimal  separators  actually  change  between 
successive  steps  of  the  search.  This  obviously  depends  on  the  underlying  structure  of 
the  dataset;  we  can  however  bound  the  maximum  number  of  edges  that  will  change 
between  two  steps  of  the  search. 

Theorem  3:  The  number  of  edges  that  need  to  be  re-examined  after  adding  edge  (a,b) 
to  the  current  reference  model  is  at  most  2(M  —  1)  —  |  N  (a)  \  —  |/V(b)  |,  where  N  (x) 
designates  the  neighbours  of  x,  i.e.  only  O(M)  edges  require  re-examination  at 
every  step. 

Proof.  Adding  (a,b)  to  a  chordal  graph  results  in  the  addition  of  only  one  maximal 
clique:  Cab  =  Sab  U  a  U  b  [32,  Section  3.2.1],  Any  new  edge  added  to  the 
clique-graph  of  the  graph  will  have  Cab  as  one  of  its  endpoints  [32,  Theorem  4.3] 
(note  that  we  use  the  term  “clique-graph”  as  defined  in  [31]).  It  results  that  any  edge 
impacted  by  the  addition  of  (a,b)  has  either  the  form  (a,x)  or  (b,x)  [32,  Proof  to 
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Theorem  4.3].  Given  that  a  can  at  most  be  connected  to  M  —  1  vertices  and  that  it  is 
already  connected  to  |lV(a)|  of  them,  there  are  at  most  M  —  1  —  |lV(a)|  edges  of 
the  form  (a,x).  Similar  reasoning  for  b.  ■ 

This  fundamental  result  establishes  that,  in  the  worst-case  scenario,  only  O(M) 
edges  have  to  be  re-examined  at  each  step.  This  strongly  contrasts  with  the 
state-of-the-art  techniques  that  require  examination  of  all  0(M2)  possible  edges. 

2.2  How  to  select  all  edges  that  need  to  be  re-examined? 

We  have  shown  in  the  last  subsection  that  an  edge  needs  to  be  re-examined  between 
two  steps  of  the  search  if  and  only  if  the  associated  minimal  separator  has  changed 
between  these  two  steps.  The  naive  way  to  select  all  the  edges  that  need  to  be 
re-examined  at  every  new  step  would  then  be  to  iterate  over  all  edges  (a,b)  and 
select  those  for  which  the  minimal  separator  has  changed.  However,  we  have  seen 
that  iterating  over  all  possible  edges  at  every  step  of  the  search  is  precisely  the 
limiting  factor  to  scale  up  to  datasets  with  thousands  of  variables.  Furthermore,  even 
this  naive  selection  would  require  prohibitive  calculations,  because  finding  all  Sab 
itself  requires  0(\V\  +  \E\)  operations  for  chordal  graphs  [33].  In  this  subsection, 
we  show  how  both  these  problems  can  be  solved  using  an  advanced  graph-theoretical 
data  structure  -  the  clique-graph  [31]: 

1.  We  demonstrate  that,  for  all  edges  (a,b)$  that  are  considered  for  addition  to 
successive  reference  models  M*  ,  their  minimal  (a,b)-separators  can  be 
efficiently  derived  from  the  clique-graph. 

2.  We  take  an  existing  algorithm  that  aims  at  maintaining  the  clique-graph  data 
structure  when  iteratively  adding  edges  to  the  supporting  graph  [32],  and 
show  how  to  modify  it  to  keep  track  of  all  minimal  (a,b)-separators. 

2.2.1.  Minimal  vertex  separators  align  with  edges  of  the  clique-graph 

The  clique-graph  structure  is  an  ideal  base-structure  for  our  task  of  keeping  track  of 
all  minimal  separators  between  the  vertices. 

Definition  2  [31,  Definition  2]:  Let  G  be  a  chordal  graph.  The  clique-graph 
C(G)  =  {VC,EC}  is  defined  by: 

•  Vc  is  the  set  of  maximal  cliques  of  G 

•  (C1;  Cf)  belongs  to  Ec  iff  C1  fl  C2  is  a  minimal  (a,b)-separator  for  each 
a  E  C1\C2  and  each  b  E  C2\C1 

We  now  formulate  the  theorem  that  is  the  base  for  tracking  the  minimal  separators. 
Theorem  4:  If  (a,b)  can  be  added  to  a  chordal  graph  G  while  maintaining  its 
chordality,  then  Sah  =  Ca  fl  Cb  where  (Ca,  Ch)  E  Ec,  a  E  Ca  and  b  E  Cb. 

Proof.  If  adding  (a,b)  maintains  the  chordality  of  G  then  3  (Ca,  Ch)  in  C (G)  such 
as  a  E  Ca  and  b  E  Cb  [32,  Lemma  3.1].  By  Definition  2,  if  (Ca,  Cb)  E  Ec,  then 
Ca  fl  Cb  is  a  minimal  (a,b)-separator.  ■ 

2.2.2.  Minimal  vertex  separators  align  with  edges  of  the  clique-graph 

We  have  demonstrated  in  Theorem  4  that  for  all  edges  (a,b),  the  minimal 
(a,b)-separator  Sab  can  be  obtained  from  edges  (Ca,  Cb)  of  the  clique-graph,  such 
as  a  E  Ca  and  b  E  Cb.  A  naive  way  of  keeping  track  of  all  the  minimal  separators 
could  thus  be  to  iterate  over  the  edges  Clt  C2)  of  the  clique-graph,  and  for  each  one 
of  them,  to  iterate  over  all  pairs  of  vertices  (x,y),x  G  C1\C1  fl  C2,y  E  C2\C2  fl 
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and  memorize  that  Sxy  =  C1  fi  C2.  This  would  however  lead  to  a  vast  amount  of 
unnecessary  computations,  because  most  of  the  structure  of  the  clique-graph  remains 
unchanged  when  adding  an  edge  to  the  associated  (normal)  graph. 

We  refine  here  the  state-of-the-art  algorithm  for  the  iterative  update  of  clique-graphs 
[32]  in  order  to  keep  track  of  all  minimal  vertex  separators.  Note  that  we  only  detail 
the  modified  parts  of  [32]  ’s  algorithm.5  The  theoretical  contribution  of  this  part  of 
the  paper  concerns  -  a  boolean  MxM-matrix  informing  about  the  eligibility  of 
any  edge  for  addition  to  the  current  graph  -  and  its  iterative  update: 

1.  We  make  £^  a  function  that  associates  to  any  pair  of  vertices  (a,b)  its 
eligibility,  its  minimal  separator  Sab  and  the  clique-graph  edge  (Ca,  Cb)  G 
Ec  such  that  a  G  Ca,  b  G  Cb  and  Ca  fl  Cb  —  Sab,  i-e-  the  two  nodes  of  the 
clique-graph  allowing  a  to  be  connected  to  b  in  G. 

2.  Following  our  Theorem  4,  every  time  a  new  edge  ( C' ,  Cab )  is  added  to  the 
clique-graph  as  a  result  of  adding  (a,b)  to  the  graph,  we  set  £^(x,  a)  to 
(true,  C'  n  Cab,C',C  ab)  for  all  (x,a)  such  as  x  G  C'^Cab ,  a  e  Cab\C  ■ 
Similarly  for  b. 

3.  Every  time  an  edge  (C1C2)  is  deleted  from  C(G)  as  a  result  of  adding  (a, b) 
to  G  and  noting  that  such  (C1C2)  will  follow  Cx  fl  C2  —  Sab ,  we  set 
£^(x,  a)  to  (false,  _)  for  all  pairs  (x,y)  such  that  x  (resp.  y)  is  in  the 
same  connected  component  as  a  (resp.  b)  in  G  —  Sab. 

In  addition,  note  that  the  scientific  community  has  challenged  the  correctness  of 
[32] ’s  algorithm,  in  particular  for  the  case  where  G  is  made  of  several  connected 
components  [34]  which  leads  to  empty  minimal  separators.  We  attribute  this  to  a  few 
unfortunate  typos  present  in  [32],  to  the  use  of  an  imprecise  vocabulary,6  and  to  the 
absence  of  any  available  implementation  of  the  algorithm.  We  have  clarified, 
corrected  and  extended  [32]’s  algorithm.  Our  algorithm  can  easily  been  reversed  back 
to  the  original  algorithm  by  only  considering  the  boolean  values  in  £^(x,  a).  Note 
that  the  validity  of  our  implementation  has  been  carefully  checked  and  tested  over 
hundreds  of  experiments,  where  we  verified  that  it  led  to  the  same  results  as 
algorithms  which  do  not  make  use  of  the  clique-graph  [35]. 

2.3  Efficiently  iterating  over  the  best  edges 

This  subsection  describes  the  last  component  of  our  algorithm:  how  to  prevent 
enumeration  over  all  possible  edges  at  every  step. 

At  every  step,  the  standard  model-search  frameworks  consider  all  the  possible 
modifications  of  the  current  reference  model.  This  requires  iteration  over  all  0(M2) 
possible  edges,  which  is  the  limiting  factor  to  perfonn  the  search  for  datasets  with 
thousands  of  variables.  As  there  are  at  most  (^)  steps,  state-of-the-art  algorithms 
can  all  lead  to  the  examination  of  0(M4)  edges. 


5  The  reader  can  refer  to  the  original  paper  and  to  our  implementation  available  at 
http://qithub.com/fpetitiean/chordalvsis  for  more  details 

6  An  example  is  the  use  of  "connected"  which  can  be  interpreted  as  the  presence  of  a 
direct  edge  between  two  vertices,  or  as  the  existence  of  a  path  connecting  these 
vertices. 
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Our  Prioritized-Chordalysis  approach  uses  a  priority  queue  to  store  the  edges  that 
have  to  be  successively  considered  for  addition  to  the  current  reference  model.  We 
keep  the  edges  ordered  by  their  associated  statistical  significance.  As  we  have  seen  in 
Section  2.1,  if  the  minimal  separator  associated  with  an  edge  does  not  change  from 
one  step  to  another,  neither  does  the  statistical  significance  associated  with  this  edge. 
This  means  that,  at  every  step,  the  only  edges  that  are  going  to  change  in  the  queue, 
are  1)  the  edges  that  are  not  eligible  anymore  because  they  would  not  keep  the  graph 
chordal,  2)  the  edges  that  are  newly  eligible  and  3)  the  edges  that  have  had  a  change 
of  minimal  separator. 

We  have  shown  in  Section  2.2  that  such  changes  are  all  associated  with  the  addition 
and  deletion  of  edges  in  the  clique-graph:  adding  a  clique-graph  edge  enables  new 
edges  (or  change  their  minimal  separator)  while  removing  a  clique-graph  edge 
disables  edges.  To  keep  the  explanation  simple,  and  because  we  will  see  that  this 
does  not  change  the  overall  complexity,  we  consider  a  priority  queue  based  on  a  heap 
data  structure,  with  retrieval  and  removal  of  the  minimum  in  0(1)  ,  and 
insertion/deletion  of  an  element  in  O(logn).  We  now  prove  that,  even  in  the  worst 
case,  our  solution  exhibits  a  far  better  complexity  than  state-of-the-art  methods. 

Initialization.  At  the  start,  all  pairs  of  edges  are  sorted  and  added  to  the  queue, 
which  requires  0(M2log(M))  operations. 

Edge  addition.  Any  new  clique-graph  edge  has  Cab  as  its  endpoint  (see  proof  to 
Theorem  3).  In  consequence,  any  edge  impacted  by  the  addition  of  (a,b)  has  either 
the  fonn  (x,a)  or  (x,b).  As  a  (and  b)  cannot  be  connected  to  more  than  M-l  vertices, 
at  every  step  of  the  search,  at  most  2(M  —  1)  edges  might  be  added  to  the  queue; 
resulting  in  a  quasi-linear  complexity  with  the  number  of  variables  for  each  of  the 
0(M2)  possible  steps,  thus  0(M3logM). 

Edge  deletion.  Any  edge  that  is  removed  from  the  priority  queue  has  obviously  to 
have  been  added  to  it.  As  there  are  at  most  0  (M2  log  M  +  M3  log  M)  such  additions, 
there  will  also  be  at  most  0  (M3  log  M)  such  deletions. 

Overall.  For  k  steps  perfonned,  our  algorithm  thus  requires  only  0(kM  log  M) 
operations;  every  step  of  LLA  exhibits  a  quasi-linear  complexity  with  the  number  of 
variables.  This  starkly  contrasts  with  the  quadratic  0(/cM2)  complexity  of 
state-of-the-art  algorithms  [32,34,12,13].  Our  experiments  will  show  that  this 
difference  makes  it  possible  to  gain  efficiency  by  several  orders  of  magnitude  and 
allows  us  to  perfonn  the  search  for  a  statistically  significant  junction-tree  model  for 
datasets  with  thousands  of  variables. 

3.  Experiment: 

We  have  shown  in  Section  2  that  Prioritized  Chordalyis  dominates  the  state  of  the  art 
in  terms  of  algorithmic  complexity.  This  section  seeks  to  demonstrate  its 
computational  superiority  on  real-world  datasets.  Note  that  this  section  does  not  seek 
to  further  assess  the  relevance  of  x2  goodness-of-fit  tests  for  learning  graphical 
models,  because  it  has  long  been  accepted  by  the  community.  Rather,  our 
experiments  seek  to  demonstrate  that  we  can  achieve  further  scalability  without 
sacrificing  the  statistical  soundness  of  the  scoring  methods.  To  this  end,  we  consider 
four  successively  refined  algorithms  for  searching  junction-tree  models,  starting  from 
the  current  state  of  the  art  for  high-dimensional  data  [12]  and  progressively 
incorporating  the  contributions  of  this  paper: 
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Version  1:  We  start  with  Chordalysis:  the  first  method  that  can  perfonn  the  search 
on  high-dimensional  data  [12]. 

Version  2:  We  integrate  the  clique-graph  update  algorithm  from  [32]  into  Version  1. 
Version  3:  We  add  to  Version  2  the  ability  to  keep  track  of  the  minimal 
(a,b)-separators. 

Version  4  —  Prioritized  Chordalysis:  We  add  to  Version  3  the  ability  to  keep  track 
of  the  best  edges  to  be  successively  added  in  a  priority  queue. 

As  we  have  demonstrated  in  Section  2,  the  worst-case  complexity  only 
depends  on  the  number  of  variables.  This  is  a  consequence  of  the  number  of  edges 
depending  on  the  number  of  vertices.  However,  the  number  of  edges  to  be  discovered 
from  data  depends  on  the  actual  dependencies  that  can  be  found  in  data.  If  the  data  is 
drawn  from  a  probability  distribution  where  all  variables  are  actually  independent, 
then  the  process  will  quickly  finish.  In  contrast,  real-world  datasets  often  exhibit 
numerous  high-order  correlations,  leading  to  more  computation  time.  In  addition,  the 
quantity  of  data  has  also  a  significant  impact  on  the  computation  time.  This  can  seem 
counter-intuitive  because  the  scoring  of  an  edge  depends  on  four  entropies  only,  and 
each  entropy  can  be  naively  computed  with  a  quasi-linear  complexity  with  the  size  of 
the  dataset.  However,  increased  data  quantity  allows  more  edges  to  be  identified  as 
statistically  significant  and  will  thus  often  lead  to  a  very  significant  increase  in  the 
computation  time.  This  is  well  exemplified  by  the  toss  of  a  coin  and  the  associated 
decision:  if  we  toss  the  coin  100  times  and  we  observe  51  heads  and  49  tails,  we 
cannot  state  that  the  coin  is  unbalanced.  However,  it  we  toss  it  100,000  times  and 
51,000  heads  and  49,000  tails,  and  while  this  is  the  same  proportion  of  heads/tails, 
statistical  tests  tell  us  that  we  can  confidently  state  that  it  is  very  unlikely  that  the 
coin  is  balanced.  This  phenomenon  is  similar  to  the  one  observed  with  the  learning  of 
decision  trees:  larger  quantities  of  data  will  tend  to  create  deeper  trees.  This  is  why 
we  use  a  broad  range  of  real-world  datasets,  with  both  various  number  of  variables 
and  various  quantities  of  data: 

•  Mushroom:  the  classical  mushroom  dataset,  22  variables,  8k  examples  [36]. 

•  EPESE:  epidemiological  study  of  the  elderly,  25  variables,  14k  examples  [37]. 

•  Internet:  demographic  information  on  internet  users,  70  variables,  10k  examples 
[36]. 

•  CoIL2000:  insurance  customer  management,  86  variables,  6k  examples  [38]. 

•  MITFace:  face  recognition  dataset,  discretized  to  4  bins  using  equal  frequency, 
362  variables,  31k  examples  [39]. 

•  Finance:  stock  performance  of  the  companies  listed  in  the  S&P500  over  20  years 
of  trading,  500  variables. 

•  Protein:  Multiple  alignment  of  the  Serpin  family  of  proteins,  750  variables,  212 
proteins  [42]. 

•  Orphamine:  Frequency  of  occurrence  of  1,260  symptoms  for  2,600  rare  diseases, 
1,260  variables,  2,600  examples  [40]. 

•  ABC:  Use  of  the  500  most  interesting  words  in  all  the  news  articles  about 
Melbourne  published  by  the  Australian  Broadcasting  Network  (ABC),  500 
variables,  35k  examples. 

•  NYT :  Use  of  the  2,000  most  interesting  words  in  10%  of  the  articles  published  by 
the  New  York  Times  from  1987  to  2007,  2,000  variables,  180k  examples  [41]. 

Where  licensing  restrictions  permit  us  to  do  so,  we  have  made  these  datasets 
available  at  http://bit.lv/PrioChordalysisRes. 
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Figure  3  presents  the  computation  time  required  to  perfonn  LLA  for  every  version  of 
the  algorithm  on  these  real-world  datasets.  Note  that  the  graphs  associated  with  each 
dataset  are  provided  at  http  ://b it.lv/PrioChordalysisRes .  These  results  confirm  the 
superiority  of  our  method.  Prioritized  Chordalysis  is  the  fastest  method  for  all 
datasets.  Moreover,  for  all  datasets  with  more  than  100  variables  (from  MIT  Face), 
Prioritized  Chordalysis  performs  the  search  with  about  4  orders  of  magnitude  faster 
than  the  state  of  the  art.  For  example,  for  the  ABC  dataset  -  which  comprises  500 
variables  -  Prioritized  Chordalysis  performs  the  search  in  27  seconds  while 
Chordalysis  (Version  1)  requires  39  hours  (to  obtain  exactly  the  same  result);  this  is 
more  than  a  5,200x  speedup. 

□  Version  1  -  ICDM  2013 

■  Version  2  -  VI  +  clique  graph 

■  Version  3  -  V2  +  keep  track  separators 

■  Version  4  -  Prioritized  Chordalysis 


Mushroom  EPESE  Internet  ColL  2000  MIT  Face  Finance  Protein  Orphamine  ABC  news  NYT 

Figure  3:  Comparison  of  the  computation  time  required  to  perform  a  forward  search  on  various 
real-world  datasets.  "+"  indicates  that  the  computation  did  not  finish  within  10  days  of  computation. 


This  is  a  major  result  that  makes  it  possible  to  tackle  datasets  with  thousands  of 
variables.  For  such  datasets,  our  experiments  indeed  show  that  Prioritized 
Chordalysis  makes  it  possible  to  perform  the  search  in  seconds  or  minutes,  when  the 
state  of  the  art  requires  days.  For  example,  for  the  NYT  dataset  -  which  comprises 
2,000  variables  -  Prioritized  Chordalysis  perfonns  the  search  in  only  3  minutes  while 
Version  1  could  not  provide  any  result  in  10  days  of  computation. 

Furthermore,  we  can  observe  that  all  the  successive  elements  that  we  have 
introduced  in  this  paper  play  a  major  role  in  making  the  search  scalable  to  very 
high-dimensional  datasets.  Each  of  the  contributions  that  we  have  made  in  this  paper 
-  from  providing  a  complete  and  correct  clique-graph-update  algorithm,  to  keeping 
track  of  the  minimal  separators  in  order  to  maintain  the  possible  modifications  in  a 
priority  queue  -  gains  one  to  two  orders  of  magnitude,  depending  on  the 
dimensionality  of  the  dataset,  amount  of  available  evidence,  and  complexity  of  the 
underlying  joint  distribution. 

Finally,  we  examine  the  scalability  of  Prioritized  Chordalysis,  on  a  dataset 
with  increasing  number  of  variables.  The  NYT  dataset  is  a  good  test  bed  for  this  task 
because  1)  it  is  our  biggest  dataset  with  180,000  instances  and  2,000  variables  and  2) 
its  variables  are  ordered  (occurrence  frequency  of  every  word),  which  makes  its 
study  possible  with  an  increasing  number  of  variables;  the  most  frequent  words  first. 
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Figure  4  presents  the  results  of  this  experiment.  We  can  observe  that  our 
proposed  algorithm,  Prioritized  Chordalysis,  greatly  dominates  all  other  methods. 
Moreover,  we  can  see  that  the  magnitude  of  the  improvement  of  Prioritized 
Chordalysis  actually  increases  over  time,  i.e.  the  plots  get  farther  apart  as  the  number 
of  variables  increases. 


Figure  4:  Comparison  of  the  computation  time  required  to  perform  the  search  with  regard  to  the 
number  of  variables  used  -dataset  NYT.  We  limited  the  discovery  to  the  first  100  edges  to  limit  the 
computation  time. 


Interestingly,  we  can  also  see  that  when  the  number  of  variables  increases,  Version  2 
tends  to  perform  as  fast  as  Version  3.  This  is  because  the  time  required  to  find  the 
minimal  separator  of  every  edge  from  the  clique  graph  (Version  2)  becomes 
negligible  relative  to  maintaining  the  structure  of  the  clique  graph.  In  consequence, 
tracking  the  minimal  separators  (Version  3)  tend  to  provide  only  marginal 
improvement  over  Version  2.  Note  however  that  Version  4  (Prioritized  Chordalysis) 
keeps  track  of  the  minimal  separators  to  maintain  the  priority  queue;  this  element  is 
thus  necessary  to  obtain  the  exhibited  improvement. 

4.  Results  and  Discussion: 

Being  able  to  predict  any  variable  from  all  the  available  information  about  a 
system  -  Omnidirectional  learning  -  is  critical  for  many  applications.  In  this  project, 
we  showed  how  junction  trees  are  an  excellent  class  of  models  to  perform  this  task, 
because  they  have  exact  and  fast  marginalization  algorithms  available.  The 
remaining  scientific  lock  was  the  scalability  of  learning  such  models. 

With  the  contributions  of  this  project,  we  have  showed  how  such  models  can 
be  learned  for  a  large  class  of  real-world  data  assets,  and  this  on  a  standard  desktop 
computer  only. 

More  specifically,  we  made  the  following  contributions: 

1 .  We  proved  that  only  a  very  small  subset  of  edges  has  to  be  considered  at  each 
step  of  the  search. 

2.  We  demonstrated  how  to  efficiently  find  this  subset  of  edges. 

3.  We  showed  how  to  efficiently  keep  track  of  the  best  edges  to  be  subsequently 
added  to  the  initial  model. 

Our  experiments,  carried  out  on  real  datasets  with  up  to  2,000  variables,  showed  that 
our  contributions  make  it  possible  to  gain  about  4  orders  of  magnitude  in 
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computation  time,  making  it  possible  to  learn  effective  junction-tree  models  of  joint 
distributions  in  seconds  instead  of  days.  Because  efficient  marginalization  exists  for 
these  models,  our  contributions  make  omnidirectional  learning  feasible  on  any 
standard  desktop  computer  for  virtually  all  datasets  with  up  to  thousands  of  variables. 

The  research  was  published  and  presented  at  the  SIAM  International  Conference  on 
Data  Mining  and  received  Best  Paper  Honorable  mention. 

To  ensure  broad  use  and  uptake  of  the  outcomes  of  this  research  project,  we  have 
released  Prioritized  Chordalysis  open-source  on  Github  at 
http:// github .com/ fpetitj ean / chordalysis.  This  will  allow  industry  and  many  fields  of 
science  to  unlock  additional  value  from  their  new  and  existing  data  assets. 
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